This capstone project will be graded by your peers. This capstone project is worth 70% of your total grade. The project will be completed over the course of 2 weeks. Week 1 submissions will be worth 30% whereas week 2 submissions will be worth 40% of your total grade.
1 / For this week, you will required to submit the following:
A description of the problem and a discussion of the background. (15 marks) A description of the data and how it will be used to solve the problem. (15 marks)
2 / For the second week, the final deliverables of the project will be:
A link to your Notebook on your Github repository, showing your code. (15 marks) A full report consisting of all of the following components (15 marks):
Introduction where you discuss the business problem and who would be interested in this project.
Data where you describe the data that will be used to solve the problem and the source of the data.
Methodology section which represents the main component of the report where you discuss and describe any exploratory data analysis that you did, any inferential statistical testing that you performed, if any, and what machine learnings were used and why.
Results section where you discuss the results.
Discussion section where you discuss any observations you noted and any recommendations you can make based on the results.
Conclusion section where you conclude the report.
3 / Your choice of a presentation or blogpost. (10 marks)
Paris and its suburbs are managed by a company called RATP for transports. We will analyse the stations from the RATP open data website and try to understand how it works trough geolocalisation data. The goal is to find geolocalised points that may suit the opening of a business based on public transport availability.
1/ We will get the food data from foursquare API and see how many food venues there is for each station as an example and plot it in folium
2/ We will use a clustering algorithm to understand the best positions that are accessible trough public transport ( of RATP ) based on their data
Anyone willing to invest in a new house or any new business that wishes to be accessible by public transports will find this analysis useful.
The data is accessible at this link https://data.ratp.fr/explore/dataset/positions-geographiques-des-stations-du-reseau-ratp/information/?disjunctive.stop_name . It is from RATP which is representing France main region transport system , it is part of the open data, which is free and accessible.
The data by itself has a latitude and longitude + name and adress of the stations.
- Latitude and longitude will be used to get specific locations and plot it on a map with folium
- Latitude and longitude will be used to get information from foursquare API.
- Latitude and longitude will be used by the algorithm to find best positions
In all cases the data needs to be transformed to be used in terms of analysis. The data set will anyway be reduced and filtered to avoid duplicates, it may be simplified as well to be able to fit the number of maximum calls per day we have with the foursquare API.
import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
import random # library for random number generation
!pip install geopy
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values
# libraries for displaying images
from IPython.display import Image
from IPython.core.display import HTML
# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize
! pip install folium==0.5.0
import folium # plotting library
print('Folium installed')
print('Libraries imported.')
import types
import pandas as pd
from botocore.client import Config
import ibm_boto3
def __iter__(self): return 0
# @hidden_cell
# The following code accesses a file in your IBM Cloud Object Storage. It includes your credentials.
# You might want to remove those credentials before you share the notebook.
client_8785d818e39141bfab835644b9aef929 = ibm_boto3.client(service_name='s3',
ibm_api_key_id='-sA_9PPSosq7PeCaEWU8fs-NcfFd4SRP8UB_ig5Bp0Hg',
ibm_auth_endpoint="https://iam.cloud.ibm.com/oidc/token",
config=Config(signature_version='oauth'),
endpoint_url='https://s3.eu-geo.objectstorage.service.networklayer.com')
body = client_8785d818e39141bfab835644b9aef929.get_object(Bucket='projectcoursera-donotdelete-pr-hhhvu4vaoaiaew',Key='positions-geographiques-des-stations-du-reseau-ratp.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )
raw_data = pd.read_csv(body, sep=";")
raw_data.head()
body = client_8785d818e39141bfab835644b9aef929.get_object(Bucket='projectcoursera-donotdelete-pr-hhhvu4vaoaiaew',Key='positions-geographiques-des-stations-du-reseau-ratp (1).csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )
raw_data_complete = pd.read_csv(body, sep=";")
raw_data_complete.head()
raw_data_complete.Coordinates.str.split(',',expand=True)
coordinate = raw_data_complete["Coordinates"].str.split(",", expand = True)
coordinate.columns = ['Latitude', 'Longitude']
frames = [raw_data_complete, coordinate]
#result = pd.concat(frames, ignore_index=False)
result = raw_data_complete.join(coordinate)
result
#result['Latitude'].astype('float64')
#result['Longitude'].astype('float64')
result["Latitude"] = pd.to_numeric(result["Latitude"])
result["Longitude"] = pd.to_numeric(result["Longitude"])
result['Name']= result['Name'].astype(str)
result.Name = result.Name.astype(str)
result.dtypes
result.drop_duplicates(subset =("Name","Description"), keep = False, inplace = True)
address = 'paris'
geolocator = Nominatim(user_agent="paris_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Paris are {}, {}.'.format(latitude, longitude))
map_of_paris = folium.Map(location=[latitude, longitude], zoom_start=12)
sum_markers =0
# add markers to map
for lat, lng, name in zip(result['Latitude'], result['Longitude'], result['Name']):
#print(str(lng) +' '+ str(lat) + ' ' +str(name))
label = ' {}'.format(name)
label = folium.Popup(label, parse_html=True)
folium.CircleMarker(
[lat, lng],
radius=0.3,
popup=label,
color='blue',
fill=True,
fill_color='#3186cc',
fill_opacity=0.7,
parse_html=True).add_to(map_of_paris)
#Follow the advancement of the process compared to the total amount of lines
sum_markers = sum_markers + 1
print(sum_markers)
map_of_paris
# @hidden_cell
CLIENT_ID = 'OK1PKC5SF4YLBOMHW2DAHZGWPDARF0MLETMAEC4PDHHCBK2Z' # your Foursquare ID
CLIENT_SECRET = 'NF5MXY1530ZQUGSKEWYHTK2YH0BU0V54G1XUBJCLMXBYY0DS' # your Foursquare Secret
ACCESS_TOKEN = 'CXKTAAIUJA1QDWYQSKXOUIOZDBKNHPZJULBJYCRB2Q1AIGRH' # your FourSquare Access Token
VERSION = '20180604'
LIMIT = 30
We will use the foursquare api to understand they different types of categories that are available
categories_url = 'https://api.foursquare.com/v2/venues/categories?client_id={}&client_secret={}&v={}'.format(
CLIENT_ID,
CLIENT_SECRET,
VERSION)
# make the GET request
results_request = requests.get(categories_url).json()
len(results_request['response']['categories'])
There are 10 top-level categories and multiple subcategories
Let's print only the top-level categories and their IDs and also add them to categories_list
categories_list = []
# Let's print only the top-level categories and their IDs and also add them to categories_list
def print_categories(categories, level=0, max_level=0):
if level>max_level: return
out = ''
out += '-'*level
for category in categories:
print(out + category['name'] + ' (' + category['id'] + ')')
print_categories(category['categories'], level+1, max_level)
categories_list.append((category['name'], category['id']))
print_categories(results_request['response']['categories'], 0, 0)
Keeping a seperate dataset for this process
result_name_duplicate = result
#result.drop_duplicates(subset =("Name"), keep = False)
result_name_duplicate.shape
def get_food_count(ll, radius, categoryId):
explore_url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&v={}&ll={}&radius={}&categoryId={}'.format(
CLIENT_ID,
CLIENT_SECRET,
VERSION,
ll,
radius,
categoryId)
# make the GET request
return requests.get(explore_url).json()['response']['totalResults']
stations_venues_df = result_name_duplicate.copy()
stations_venues_df.shape
stations_venues_df.head()
for i, row in stations_venues_df.iterrows():
stations_venues_df.loc[i, 'Food Venues'] = get_food_count(stations_venues_df.Coordinates.iloc[i], radius=150, categoryId='4d4b7105d754a06374d81259')
stations_venues_df.to_csv('stations_venues.csv')
as a side note
i may have issue with the number of request with foursquare api. It stops after a point but the count is working anyway because we are putting the information into a CSV then reusing that CSV only.
df_food_counts = pd.read_csv('stations_venues.csv',index_col=0)
df_food_counts.head()
map_of_paris_food_size = folium.Map(location=[48.8, 2.35], zoom_start=8)
#Map based on the value of the food venues to understand areas where the food venues are higher
sum_markers =0
# add markers to map
for lat, lng, name, value in zip(df_food_counts['Latitude'], df_food_counts['Longitude'], df_food_counts['Description'], df_food_counts['Food Venues']):
label = ' {}'.format(name)
label = folium.Popup(label, parse_html=True)
folium.CircleMarker(
[lat, lng],
radius=value*1.5,
popup=label,
color='blue',
fill=True,
fill_color='#3186cc',
fill_opacity=0.7,
parse_html=True).add_to(map_of_paris_food_size)
The map allows us to see places where there are more food venues according to foursquare.
map_of_paris_food_size
result.head()
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
# Variable with the Longitude and Latitude
X=result.loc[:,['ID','Latitude','Longitude']]
X.head(10)
K_clusters = range(1,10)
kmeans = [KMeans(n_clusters=i) for i in K_clusters]
Y_axis = result[['Latitude']]
X_axis = result[['Longitude']]
score = [kmeans[i].fit(Y_axis).score(Y_axis) for i in range(len(kmeans))]
# Visualize
plt.plot(K_clusters, score)
plt.xlabel('Number of Clusters')
plt.ylabel('Score')
plt.title('Elbow Curve')
plt.show()
We see in the graph that the curves goes slowly after a number of cluster of 3 or 4, which means after this point it will not be useful to increase the numbers of clusters
kmeans = KMeans(n_clusters = 4, init ='k-means++') # chosing 4 as the number of clusters
kmeans.fit(X[X.columns[1:3]]) # Compute k-means clustering on the subset we decided to take above called "X"
X['cluster_label'] = kmeans.fit_predict(X[X.columns[1:3]])
centers = kmeans.cluster_centers_ # Coordinates of cluster centers.
labels = kmeans.predict(X[X.columns[1:3]]) # Labels of each point
X.head(10)
X.plot.scatter(x = 'Latitude', y = 'Longitude', c=labels, s=50, cmap='viridis')
plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha=0.5)
X = X[['ID','cluster_label']]
X.head(5)
clustered_data = result.merge(X, left_on='ID', right_on='ID')
clustered_data.head(5)
centers = kmeans.cluster_centers_
print(centers)
!pip install reverse_geocoder
import reverse_geocoder as rg
coordinates_center = (48.86647257, 2.24221891), (48.90685781, 2.37616476), (48.78778046, 2.33707742),(48.84232863, 2.51082987)
rg.search(coordinates_center)
I first decided to go with a subset of the whole data ( 25000 lines was too much ) , but then i decided to do half the work with the whole dataset. That is why for the first folium map we have all data points but for the second we remove even more duplicates juste on the name of the station. One station can have multiple entrance. The goal of this was also to simplify the number of request to foursquare, indeed I had **quota exceed errors*** because i made too many calls
The data was taken since for each station we have a localisation point, it was also easier to use a CSV rather than using the geopy api to find specific coordinates
I had issue due to the maximum number of calls, i had to diminish the number of lines ( from 25.000 to 5.000) and i couldn't work everyday since the number of calls is per day. I decided to simplify the process with the steps you saw before
Folium and the map was used to understand the data as exploratory analysis since we only have localisation points in this data, nothing else
The k-mean algorithm was able to give us centroid to open a business in a very efficient way Even if we used k-mean, we may need to have a more specific view based on neighborhood in order to find the best locations per neighborhood, because even if this process show the well used k-mean, the input data may not be the best because now we have positions based on a big radius of more than 15 kms.
After the analysis i find the results unsettling.
- The open data itself is interesting but it do not seem exhaustive since some stations are not in the dataset
Finally,
- We we able to plot all the stations on the folium map
- We were able to plot the stations that had a higher number of food venues
- We were able to understand which points were more suitable for business opening in terms of transport system accessibility by using the k-mean algorithm
For the scope of this course assigment we won't go further in this analysis, I believe we went trough all the different topics except inferential statistics which seem difficult considering the type of data we are facing
Some work needs to be done on the following points
- Find more data sources and joining them to have a better base dataset
- Using a professional account of foursquare to get all categories for all the localisation points
- Finding a France dataset to be able to extend this analysis to France and not only to one city like Paris
- Seperating the stations by bus, metro, RER...
The report will be in pdf format available on github.
!pip install -U notebook-as-pdf
jupyter-nbconvert --to PDFviaHTML example.ipynb